AboutPublicationsExperiencesProjectsCVBlogContact
Bike Sharing Demand Prediction
Back to Projects
Data Mining · Machine Learning

Bike Sharing Demand Prediction

Data-mining Course · Kharazmi University · Spring 2023 · Seoul Bike Sharing Dataset

Overview

In this project, I build a demand-forecasting pipeline that combines historical rental counts with weather and time-of-day features to predict hourly bike-sharing usage. Several machine learning and data-mining algorithms, including Linear Regression, SVM, Gradient Boosting, and more, are implemented and compared using cross-validation. Each model is then evaluated on RMSE and MAE, and feature importances are analyzed to guide dynamic bike-redistribution strategies.

A secondary goal of this project is to learn and teach at the same time. I started it to deepen my own understanding of these methods and to share what I learned so that it might be useful to others. All resources used throughout the project are listed at the end.

Why does this matter? In recent years, many cities and businesses have turned to bike-sharing systems to improve urban mobility. Accurate demand forecasting is critical to their success. Over-supplying bikes inflates maintenance and parking costs, while under-supplying them leads to lost revenue and poor user experience. A reliable prediction model helps operators balance supply with demand, cutting unnecessary costs and maximizing profitability.

Dataset

The Seoul Bike Sharing dataset contains 8,760 hourly observations spanning December 2017 to November 2018. Each record captures the number of bikes rented along with meteorological and temporal features. There are no missing values in the dataset, which makes it clean and ready for analysis right away.

8,760
Total Records
14
Features
0
Missing Values
705
Avg Hourly Rentals

Features include temperature, humidity, wind speed, visibility, dew point, solar radiation, rainfall, snowfall, season, holiday indicator, and functioning day status.

Exploratory Data Analysis

Target Variable Distribution and Box-Cox Normalization

The rental count is heavily right-skewed (skewness = 1.15), with most hours recording under 1,000 rentals and a long tail stretching up to 3,556. To handle this, I applied the Box-Cox transformation from scipy.stats, which automatically finds the optimal lambda parameter. The resulting lambda of 0.35 brought the skewness almost to zero (-0.13), which is far better than a simple log transformation that actually overshoots into negative skew (-1.83). This is a great example of why Box-Cox is often preferred over log transforms for normalizing skewed distributions.

Target distribution
Figure 1. Rented Bike Count distribution: original (left), Box-Cox transformed (center), and log-transformed (right). Box-Cox achieves near-zero skewness.

Hourly Demand Pattern

Looking at how demand changes throughout the day, we can see a clear bimodal pattern. There are two distinct peaks: one at 8 AM (morning commute) and a larger one at 6 PM (evening commute). Demand drops to its lowest point between 1 AM and 5 AM, which makes intuitive sense. The error bars show substantial variability at peak hours, meaning that while the average is high, actual demand can swing quite a bit depending on weather and whether it is a working day.

Hourly demand
Figure 2. Average bike rentals by hour of day with standard deviation bars.

Seasonal Pattern

Summer is the clear winner with an average of 1,034 rentals per hour, followed by Autumn (820) and Spring (730). Winter drops dramatically to just 226 average rentals, which is less than a quarter of summer demand. This seasonal swing is something bike-sharing operators need to plan for seriously, as it affects fleet sizing, maintenance scheduling, and staffing.

Seasonal demand
Figure 3. Average bike rentals by season.

Temperature vs Rentals

Temperature is the single strongest predictor of bike rentals (Pearson r = 0.54). The scatter plot below, colored by hour of day, reveals an interesting pattern: it is not just that warmer days attract more riders, but that the combination of warm temperatures and evening hours (the purple/pink dots in the upper right) produces the highest demand spikes. This interaction effect is something tree-based models capture naturally.

Temperature scatter
Figure 4. Temperature vs Rented Bike Count, colored by hour of day.

Distribution by Season and Holiday

The boxplots confirm the seasonal story and add an interesting detail about holidays. Working days actually show slightly higher median demand than holidays, likely because commute-related rides make up a large share of total usage. The spread on holidays is also wider, suggesting more unpredictable leisure usage.

Boxplots
Figure 5. Rental distribution by season (left) and working day vs holiday (right).

Pairwise Feature Relationships

To get the full picture of how every feature relates to every other feature, I created a 10x10 pairwise scatter matrix. Each off-diagonal cell shows a scatter plot between two features with the correlation coefficient displayed. The diagonal shows histograms. A few things jump out: Temperature and Dew Point are extremely correlated (r = 0.91), which signals multicollinearity. Humidity and Visibility have a strong negative relationship (r = -0.54). These insights help us understand the structure of our data before feeding it into models.

Pairplot
Figure 6. Pairwise scatter matrix for all numeric features with correlation coefficients.

Complete Correlation Heatmap

The full (symmetric) correlation heatmap below shows every pairwise Pearson correlation. The strongest predictors of bike rentals are Temperature (+0.54), Hour (+0.41), and Dew Point (+0.38). On the negative side, Humidity (-0.20), Snowfall (-0.14), and Rainfall (-0.12) all reduce demand. Notice the near-perfect correlation between Temperature and Dew Point, which is expected since dew point is physically tied to temperature.

Correlation heatmap
Figure 7. Complete Pearson correlation heatmap for all numeric features.

Data Preprocessing

Before training models, I applied the following preprocessing steps to clean and enrich the data:

The final feature set contains 15 features: the original numeric weather variables plus the engineered temporal features.

Model Training and Results

I trained nine different models to get a comprehensive comparison across linear, instance-based, and ensemble methods. All models were evaluated using 5-fold cross-validation on the training set, with final performance measured on the held-out test set.

ModelRMSEMAECV-RMSE
Gradient Boosting BEST160.7386.530.9342147.41
Random Forest172.9895.780.9238168.98
Decision Tree240.89131.220.8522230.12
K-Nearest Neighbors252.95167.790.837280.42
SVR256.5165.930.8324279.66
AdaBoost310.21238.830.7549320.63
Lasso Regression395.04302.190.6025406.53
Ridge Regression395.24302.240.6021406.5
Linear Regression395.25302.240.6021406.51
Model comparison
Figure 8. Model comparison across RMSE, MAE, and R² metrics. Best model highlighted in teal.

Best Model: Gradient Boosting

Gradient Boosting came out on top with an R² of 0.9342 and the lowest RMSE of 160.73 among all models. Random Forest was a close second at R² = 0.9238. The linear models (Linear, Ridge, Lasso) all hovered around R² = 0.60, which tells us clearly that the relationship between features and demand is nonlinear. This is not surprising given the bimodal hourly pattern and the interactions between temperature and time of day.

160.73
RMSE
86.53
MAE
0.9342
R² Score
147.41
CV-RMSE

Actual vs Predicted

The scatter plots below show how well the top three models predict actual demand. Points falling on the red dashed line represent perfect predictions. Gradient Boosting and Random Forest both track the diagonal closely, though both tend to slightly underpredict at the extreme high end (above 2,500 rentals).

Actual vs Predicted
Figure 9. Actual vs Predicted values for the top 3 models.

Residual Analysis

The residuals for Gradient Boosting are centered around zero (mean = -4.72), which is good. The distribution looks roughly normal, though there is slight heteroscedasticity: the model's errors tend to be larger for higher predicted values. This is common in count data and could potentially be improved by modeling the log-transformed target or using a different loss function.

Residuals
Figure 10. Residual plot (left) and residual distribution (right) for Gradient Boosting.

Feature Importance Analysis

Temperature and Hour are by far the two most important features, together accounting for over 62% of the model's predictive power. This makes sense: temperature determines whether people want to ride, and hour determines whether people need to ride (commuting patterns). Humidity, Solar Radiation, and the IsRushHour flag form a secondary tier of influence.

Feature importance
Figure 11. Feature importance from Gradient Boosting. Top features highlighted in teal.

Random Forest vs Gradient Boosting

Both ensemble methods agree almost perfectly on which features matter most. Temperature and Hour dominate in both models, and even the secondary features are ranked similarly. This consistency across different algorithms gives us confidence that these importance rankings reflect genuine patterns in the data, not artifacts of a single model's behavior.

RF vs GB importance
Figure 12. Feature importance comparison between Random Forest and Gradient Boosting.

Complete Code Notebook

All code cells with their outputs and chart results. Click any cell to expand. Use the Copy button to grab the code.

In [1]Import Libraries & Load Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import (RandomForestRegressor, 
                               GradientBoostingRegressor, AdaBoostRegressor)
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Load data
df = pd.read_csv('SeoulBikeData.csv', encoding='latin-1')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print("\nFirst 5 rows:")
df.head()
Out [1]
Dataset shape: (8760, 14)
Columns: ['Date', 'Rented Bike Count', 'Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons', 'Holiday', 'Functioning Day']

Data types:
Date                             str
Rented Bike Count              int64
Hour                           int64
Temperature(°C)              float64
Humidity(%)                    int64
Wind speed (m/s)             float64
Visibility (10m)               int64
Dew point temperature(°C)    float64
Solar Radiation (MJ/m2)      float64
Rainfall(mm)                 float64
Snowfall (cm)                float64
Seasons                          str
Holiday                          str
Functioning Day                  str
dtype: object

Missing values:
Date                         0
Rented Bike Count            0
Hour                         0
Temperature(°C)              0
Humidity(%)                  0
Wind speed (m/s)             0
Visibility (10m)             0
Dew point temperature(°C)    0
Solar Radiation (MJ/m2)      0
Rainfall(mm)                 0
Snowfall (cm)                0
Seasons                      0
Holiday                      0
Functioning Day              0
dtype: int64

First 5 rows:
         Date  Rented Bike Count  Hour  Temperature(°C)  Humidity(%)  Wind speed (m/s)  Visibility (10m)  Dew point temperature(°C)  Solar Radiation (MJ/m2)  Rainfall(mm)  Snowfall (cm) Seasons     Holiday Functioning Day
0  01/12/2017                254     0             -5.2           37               2.2              2000                      -17.6                      0.0           0.0            0.0  Winter  No Holiday             Yes
1  01/12/2017                204     1             -5.5           38               0.8              2000                      -17.6                      0.0           0.0            0.0  Winter  No Holiday             Yes
2  01/12/2017                173     2             -6.0           39               1.0              2000                      -17.7                      0.0           0.0            0.0  Winter  No Holiday             Yes
3  01/12/2017                107     3             -6.2           40               0.9              2000                      -17.6                      0.0           0.0            0.0  Winter  No Holiday             Yes
4  01/12/2017                 78     4             -6.0           36               2.3              2000                      -18.6                      0.0           0.0            0.0  Winter  No Holiday             Yes
In [2]Descriptive Statistics
# Descriptive statistics
print("Descriptive Statistics:")
print(df.describe().round(2).to_string())
print(f"\nUnique Seasons: {df['Seasons'].unique().tolist()}")
print(f"Unique Holiday: {df['Holiday'].unique().tolist()}")
print(f"Unique Functioning Day: {df['Functioning Day'].unique().tolist()}")
print(f"Date range: {df['Date'].iloc[0]} to {df['Date'].iloc[-1]}")
Out [2]
Descriptive Statistics:
       Rented Bike Count     Hour  Temperature(°C)  Humidity(%)  Wind speed (m/s)  Visibility (10m)  Dew point temperature(°C)  Solar Radiation (MJ/m2)  Rainfall(mm)  Snowfall (cm)
count            8760.00  8760.00          8760.00      8760.00           8760.00           8760.00                    8760.00                  8760.00       8760.00        8760.00
mean              704.60    11.50            12.88        58.23              1.72           1436.83                       4.07                     0.57          0.15           0.08
std               645.00     6.92            11.94        20.36              1.04            608.30                      13.06                     0.87          1.13           0.44
min                 0.00     0.00           -17.80         0.00              0.00             27.00                     -30.60                     0.00          0.00           0.00
25%               191.00     5.75             3.50        42.00              0.90            940.00                      -4.70                     0.00          0.00           0.00
50%               504.50    11.50            13.70        57.00              1.50           1698.00                       5.10                     0.01          0.00           0.00
75%              1065.25    17.25            22.50        74.00              2.30           2000.00                      14.80                     0.93          0.00           0.00
max              3556.00    23.00            39.40        98.00              7.40           2000.00                      27.20                     3.52         35.00           8.80

Unique Seasons: ['Winter', 'Spring', 'Summer', 'Autumn']
Unique Holiday: ['No Holiday', 'Holiday']
Unique Functioning Day: ['Yes', 'No']
Date range: 01/12/2017 to 30/11/2018
In [3]Target Distribution & Box-Cox Normalization
# Target Distribution + Box-Cox Normalization
y_raw = df['Rented Bike Count']
y_positive = y_raw + 1  # Box-Cox requires strictly positive values
y_boxcox, bc_lambda = stats.boxcox(y_positive)

print(f"Box-Cox optimal lambda: {bc_lambda:.4f}")
print(f"Original  -- Skewness: {y_raw.skew():.4f}, Kurtosis: {y_raw.kurtosis():.4f}")
print(f"Box-Cox   -- Skewness: {pd.Series(y_boxcox).skew():.4f}, Kurtosis: {pd.Series(y_boxcox).kurtosis():.4f}")
print(f"Log1p     -- Skewness: {np.log1p(y_raw).skew():.4f}, Kurtosis: {np.log1p(y_raw).kurtosis():.4f}")

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
axes[0].hist(y_raw, bins=50, color='#64ffda', alpha=0.8)
axes[0].set_title('Original Distribution')
axes[0].axvline(y_raw.mean(), color='red', linestyle='--', label=f'Mean: {y_raw.mean():.0f}')
axes[0].axvline(y_raw.median(), color='orange', linestyle='--', label=f'Median: {y_raw.median():.0f}')
axes[0].legend()

axes[1].hist(y_boxcox, bins=50, color='#e07070', alpha=0.8)
axes[1].set_title(f'Box-Cox Transformed (lambda={bc_lambda:.2f})')

axes[2].hist(np.log1p(y_raw), bins=50, color='#e5c07b', alpha=0.8)
axes[2].set_title('Log(1+x) Transformed')
plt.tight_layout()
plt.show()
Out [3]
Box-Cox optimal lambda: 0.3495
Original  — Skewness: 1.1534, Kurtosis: 0.8534
Box-Cox   — Skewness: -0.1274, Kurtosis: -0.5611
Log1p     — Skewness: -1.8322, Kurtosis: 4.3387
Out [3]
In [4]Hourly Demand Pattern
# Average bike rentals by hour
hourly = df.groupby('Hour')['Rented Bike Count'].agg(['mean', 'std', 'median'])
print("Hourly Statistics:")
print(hourly.round(1).to_string())

fig, ax = plt.subplots(figsize=(12, 5))
ax.bar(hourly.index, hourly['mean'], color='#64ffda', alpha=0.85)
ax.errorbar(hourly.index, hourly['mean'], yerr=hourly['std'], fmt='none', 
            ecolor='gray', alpha=0.4, capsize=2)
ax.set_title('Average Bike Rentals by Hour (with Std Dev)')
ax.set_xlabel('Hour of Day')
ax.set_ylabel('Average Rentals')
ax.set_xticks(range(0, 24))
plt.tight_layout()
plt.show()
Out [4]
Hourly Statistics:
        mean     std  median
Hour                        
0      541.5   364.6   513.0
1      426.2   285.5   401.0
2      301.6   210.1   265.0
3      203.3   143.2   176.0
4      132.6    90.3   119.0
5      139.1    95.5   129.0
6      287.6   222.8   232.0
7      606.0   482.3   426.0
8     1015.7   761.6   728.0
9      646.0   398.6   680.0
10     527.8   323.0   581.0
11     600.9   362.0   624.0
12     699.4   430.7   709.0
13     733.2   457.7   727.0
14     758.8   488.9   733.0
15     829.2   546.5   785.0
16     930.6   618.0   911.0
17    1138.5   748.9  1184.0
18    1502.9  1029.3  1548.0
19    1195.1   857.4  1224.0
20    1069.0   793.9  1062.0
21    1031.4   753.6  1046.0
22     922.8   660.8   949.0
23     671.1   478.8   656.0
Out [4]
In [5]Seasonal Pattern
# Seasonal pattern
season_order = ['Spring', 'Summer', 'Autumn', 'Winter']
seasonal = df.groupby('Seasons')['Rented Bike Count'].agg(['mean', 'std', 'count']).reindex(season_order)
print("Seasonal Statistics:")
print(seasonal.round(1).to_string())

fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(seasonal.index, seasonal['mean'], 
              color=['#64ffda', '#e07070', '#e5c07b', '#6a9a8a'], alpha=0.85)
ax.set_title('Average Bike Rentals by Season')
ax.set_ylabel('Average Rentals')
for bar, val in zip(bars, seasonal['mean']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 15, 
            f'{val:.0f}', ha='center')
plt.tight_layout()
plt.show()
Out [5]
Seasonal Statistics:
           mean    std  count
Seasons                      
Spring    730.0  621.5   2208
Summer   1034.1  690.2   2208
Autumn    819.6  651.1   2184
Winter    225.5  150.4   2160
Out [5]
In [6]Temperature vs Rentals
# Temperature vs Rentals scatter
corr_temp = df['Temperature(C)'].corr(df['Rented Bike Count'])
print(f"Pearson correlation (Temperature vs Rentals): {corr_temp:.4f}")

fig, ax = plt.subplots(figsize=(10, 6))
scatter = ax.scatter(df['Temperature(C)'], df['Rented Bike Count'], 
                     c=df['Hour'], cmap='cool', alpha=0.3, s=8)
plt.colorbar(scatter, ax=ax, label='Hour of Day')
ax.set_title('Temperature vs Bike Rentals (colored by Hour)')
ax.set_xlabel('Temperature (C)')
ax.set_ylabel('Rented Bike Count')
plt.tight_layout()
plt.show()
Out [6]
Pearson correlation (Temperature vs Rentals): 0.5386
Out [6]
In [7]Boxplots: Season & Holiday
# Boxplots: Season & Holiday
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
season_data = [df[df['Seasons'] == s]['Rented Bike Count'].values for s in season_order]
axes[0].boxplot(season_data, labels=season_order, patch_artist=True,
                boxprops=dict(facecolor='#64ffda', alpha=0.6))
axes[0].set_title('Rental Distribution by Season')

holiday_data = [df[df['Holiday'] == h]['Rented Bike Count'].values 
                for h in ['No Holiday', 'Holiday']]
axes[1].boxplot(holiday_data, labels=['Working Day', 'Holiday'], patch_artist=True,
                boxprops=dict(facecolor='#e07070', alpha=0.6))
axes[1].set_title('Rental Distribution: Working Day vs Holiday')
plt.tight_layout()
plt.show()
Out [7]
In [8]Pairwise Feature Relationships (All Features)
# Full Pairwise Scatter Plot (all numeric features)
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric features for pairplot ({len(numeric_cols)}): {numeric_cols}")

corr_matrix = df[numeric_cols].corr()
print(f"\nFull Pairwise Correlation Matrix:")
print(corr_matrix.round(4).to_string())

# Pairplot with correlation values
sample = df.sample(n=1500, random_state=42)[numeric_cols]
n = len(numeric_cols)
fig, axes = plt.subplots(n, n, figsize=(28, 28))
for i in range(n):
    for j in range(n):
        ax = axes[i][j]
        if i == j:
            ax.hist(sample[numeric_cols[i]], bins=25, color='#64ffda', alpha=0.7)
        else:
            ax.scatter(sample[numeric_cols[j]], sample[numeric_cols[i]], s=1.5, alpha=0.35)
            r = corr_matrix.iloc[i, j]
            ax.text(0.95, 0.95, f'{r:.2f}', transform=ax.transAxes, 
                    ha='right', va='top', fontsize=6, fontweight='bold')
        if i == n - 1: ax.set_xlabel(numeric_cols[j], fontsize=5, rotation=45, ha='right')
        if j == 0: ax.set_ylabel(numeric_cols[i], fontsize=5)
fig.suptitle('Pairwise Feature Relationships', fontsize=16)
plt.tight_layout()
plt.show()
Out [8]
Numeric features for pairplot (10): ['Rented Bike Count', 'Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']

Full Pairwise Correlation Matrix:
                           Rented Bike Count    Hour  Temperature(°C)  Humidity(%)  Wind speed (m/s)  Visibility (10m)  Dew point temperature(°C)  Solar Radiation (MJ/m2)  Rainfall(mm)  Snowfall (cm)
Rented Bike Count                     1.0000  0.4103           0.5386      -0.1998            0.1211            0.1993                     0.3798                   0.2618       -0.1231        -0.1418
Hour                                  0.4103  1.0000           0.1241      -0.2416            0.2852            0.0988                     0.0031                   0.1451        0.0087        -0.0215
Temperature(°C)                       0.5386  0.1241           1.0000       0.1594           -0.0363            0.0348                     0.9128                   0.3535        0.0503        -0.2184
Humidity(%)                          -0.1998 -0.2416           0.1594       1.0000           -0.3367           -0.5431                     0.5369                  -0.4619        0.2364         0.1082
Wind speed (m/s)                      0.1211  0.2852          -0.0363      -0.3367            1.0000            0.1715                    -0.1765                   0.3323       -0.0197        -0.0036
Visibility (10m)                      0.1993  0.0988           0.0348      -0.5431            0.1715            1.0000                    -0.1766                   0.1497       -0.1676        -0.1217
Dew point temperature(°C)             0.3798  0.0031           0.9128       0.5369           -0.1765           -0.1766                     1.0000                   0.0944        0.1256        -0.1509
Solar Radiation (MJ/m2)               0.2618  0.1451           0.3535      -0.4619            0.3323            0.1497                     0.0944                   1.0000       -0.0743        -0.0723
Rainfall(mm)                         -0.1231  0.0087           0.0503       0.2364           -0.0197           -0.1676                     0.1256                  -0.0743        1.0000         0.0085
Snowfall (cm)                        -0.1418 -0.0215          -0.2184       0.1082           -0.0036           -0.1217                    -0.1509                  -0.0723        0.0085         1.0000
Out [8]
In [9]Complete Correlation Heatmap
# Complete Correlation Heatmap (full matrix, no mask)
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
corr = df[numeric_cols].corr()

target_corr = corr['Rented Bike Count'].drop('Rented Bike Count').sort_values(key=abs, ascending=False)
print("Correlations with Rented Bike Count (sorted by |r|):")
for feat, r in target_corr.items():
    bar = chr(9608) * int(abs(r) * 30)
    sign = '+' if r > 0 else '-'
    print(f"  {feat:35s}  {sign}{abs(r):.4f}  {bar}")

fig, ax = plt.subplots(figsize=(14, 11))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdYlGn', center=0,
            ax=ax, square=True, linewidths=0.5, vmin=-1, vmax=1,
            cbar_kws={'shrink': 0.8}, annot_kws={'size': 8})
ax.set_title('Complete Correlation Heatmap (All Features)')
plt.tight_layout()
plt.show()
Out [9]
Correlations with Rented Bike Count (sorted by |r|):
  Temperature(°C)                      +0.5386  ████████████████
  Hour                                 +0.4103  ████████████
  Dew point temperature(°C)            +0.3798  ███████████
  Solar Radiation (MJ/m2)              +0.2618  ███████
  Humidity(%)                          -0.1998  █████
  Visibility (10m)                     +0.1993  █████
  Snowfall (cm)                        -0.1418  ████
  Rainfall(mm)                         -0.1231  ███
  Wind speed (m/s)                     +0.1211  ███
Out [9]
In [10]Data Preprocessing & Feature Engineering
# Data Preprocessing & Feature Engineering
df_model = df[df['Functioning Day'] == 'Yes'].copy()
print(f"After filtering Functioning Day = Yes: {len(df_model)} rows")

df_model['Date'] = pd.to_datetime(df_model['Date'], format='%d/%m/%Y')
df_model['Month'] = df_model['Date'].dt.month
df_model['DayOfWeek'] = df_model['Date'].dt.dayofweek
df_model['IsWeekend'] = (df_model['DayOfWeek'] >= 5).astype(int)
df_model['IsRushHour'] = df_model['Hour'].apply(
    lambda x: 1 if x in [7, 8, 9, 17, 18, 19] else 0)

le_season = LabelEncoder()
df_model['Seasons_enc'] = le_season.fit_transform(df_model['Seasons'])
df_model['Holiday_enc'] = (df_model['Holiday'] == 'Holiday').astype(int)

feature_cols = [
    'Hour', 'Temperature(C)', 'Humidity(%)', 'Wind speed (m/s)',
    'Visibility (10m)', 'Dew point temperature(C)',
    'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)',
    'Seasons_enc', 'Holiday_enc', 'Month', 'DayOfWeek', 'IsWeekend', 'IsRushHour'
]
X = df_model[feature_cols]
y = df_model['Rented Bike Count']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Features ({len(feature_cols)}): {feature_cols}")
print(f"Train set: {len(X_train)} samples")
print(f"Test set:  {len(X_test)} samples")
Out [10]
After filtering Functioning Day = Yes: 8465 rows (removed 295)
Season encoding: {'Autumn': np.int64(0), 'Spring': np.int64(1), 'Summer': np.int64(2), 'Winter': np.int64(3)}

Features (15): ['Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons_enc', 'Holiday_enc', 'Month', 'DayOfWeek', 'IsWeekend', 'IsRushHour']
Train set: 6772 samples
Test set:  1693 samples

Scaler means: {'Hour': np.float64(11.48), 'Temperature(°C)': np.float64(12.79), 'Humidity(%)': np.float64(58.32), 'Wind speed (m/s)': np.float64(1.73), 'Visibility (10m)': np.float64(1429.69), 'Dew point temperature(°C)': np.float64(4.0), 'Solar Radiation (MJ/m2)': np.float64(0.57), 'Rainfall(mm)': np.float64(0.15), 'Snowfall (cm)': np.float64(0.08), 'Seasons_enc': np.float64(1.54), 'Holiday_enc': np.float64(0.05), 'Month': np.float64(6.43), 'DayOfWeek': np.float64(3.02), 'IsWeekend': np.float64(0.29), 'IsRushHour': np.float64(0.25)}
In [11]Train & Evaluate All Models
# Train & Evaluate All Models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=1.0),
    'K-Nearest Neighbors': KNeighborsRegressor(n_neighbors=5),
    'Decision Tree': DecisionTreeRegressor(max_depth=12, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=200, max_depth=15, 
                                           random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=300, max_depth=6, 
                                                    learning_rate=0.1, random_state=42),
    'AdaBoost': AdaBoostRegressor(n_estimators=150, learning_rate=0.1, random_state=42),
    'SVR': SVR(kernel='rbf', C=100, epsilon=0.1),
}
scale_models = {'Linear Regression', 'Ridge Regression', 'Lasso Regression',
                'K-Nearest Neighbors', 'SVR'}

results = {}
for name, model in models.items():
    if name in scale_models:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        cv = cross_val_score(model, X_train_scaled, y_train, cv=5, 
                            scoring='neg_mean_squared_error')
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        cv = cross_val_score(model, X_train.values, y_train.values, cv=5,
                            scoring='neg_mean_squared_error')

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    cv_rmse = np.sqrt(-cv.mean())

    results[name] = {'RMSE': rmse, 'MAE': mae, 'R2': r2, 'CV_RMSE': cv_rmse}
    print(f"{name:<25s} RMSE: {rmse:.2f}  MAE: {mae:.2f}  R2: {r2:.4f}  CV-RMSE: {cv_rmse:.2f}")
Out [11]
Model                         RMSE      MAE       R²   CV-RMSE
--------------------------------------------------------------
Linear Regression           395.25   302.24   0.6021    406.51
Ridge Regression            395.24   302.24   0.6021    406.50
Lasso Regression            395.04   302.19   0.6025    406.53
K-Nearest Neighbors         252.95   167.79   0.8370    280.42
Decision Tree               240.89   131.22   0.8522    230.12
Random Forest               172.98    95.78   0.9238    168.98
Gradient Boosting           160.73    86.53   0.9342    147.41
AdaBoost                    310.21   238.83   0.7549    320.63
SVR                         256.50   165.93   0.8324    279.66

>>> Best model: Gradient Boosting (RMSE=160.73, R²=0.9342)
In [12]Model Comparison Charts
# Model Comparison Charts
model_names = list(results.keys())
rmse_vals = [results[m]['RMSE'] for m in model_names]
mae_vals = [results[m]['MAE'] for m in model_names]
r2_vals = [results[m]['R2'] for m in model_names]

fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for ax, vals, title, best_fn in [
    (axes[0], rmse_vals, 'RMSE (lower is better)', np.argmin),
    (axes[1], mae_vals, 'MAE (lower is better)', np.argmin),
    (axes[2], r2_vals, 'R2 Score (higher is better)', np.argmax),
]:
    sorted_idx = np.argsort(vals) if 'lower' in title else np.argsort(vals)[::-1]
    best_i = best_fn(vals)
    colors = ['#64ffda' if i == best_i else '#4a4a5a' for i in range(len(model_names))]
    ax.barh([model_names[i] for i in sorted_idx], [vals[i] for i in sorted_idx],
            color=[colors[i] for i in sorted_idx], alpha=0.85)
    ax.set_title(title)
plt.tight_layout()
plt.show()
Out [12]
In [13]Actual vs Predicted (Top 3)
# Actual vs Predicted -- Top 3 Models
top3 = sorted(results, key=lambda k: results[k]['RMSE'])[:3]
fig, axes = plt.subplots(1, 3, figsize=(18, 5.5))
for ax, name in zip(axes, top3):
    y_pred = models[name].predict(X_test_scaled if name in scale_models else X_test)
    ax.scatter(y_test, y_pred, alpha=0.25, s=8, color='#64ffda')
    max_val = max(y_test.max(), y_pred.max())
    ax.plot([0, max_val], [0, max_val], '--', color='red', linewidth=1.5, label='Perfect')
    ax.set_title(f'{name}\nR2 = {results[name]["R2"]:.4f}')
    ax.set_xlabel('Actual'); ax.set_ylabel('Predicted')
    ax.legend(fontsize=8)
plt.tight_layout()
plt.show()
Out [13]
In [14]Feature Importance Analysis
# Feature Importance -- Gradient Boosting
gb = models['Gradient Boosting']
importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': gb.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (Gradient Boosting):")
for _, row in importance.iterrows():
    bar = chr(9608) * int(row['importance'] * 60)
    print(f"  {row['feature']:35s}  {row['importance']:.4f}  {bar}")

importance_sorted = importance.sort_values('importance', ascending=True)
fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(importance_sorted['feature'], importance_sorted['importance'], 
        color=['#64ffda' if v >= importance['importance'].quantile(0.75) 
               else '#4a4a5a' for v in importance_sorted['importance'].values])
ax.set_title('Feature Importance (Gradient Boosting)')
ax.set_xlabel('Importance')
plt.tight_layout()
plt.show()
Out [14]
Feature Importance (Gradient Boosting):
  Temperature(°C)                      0.3379  ████████████████████
  Hour                                 0.2866  █████████████████
  Humidity(%)                          0.0823  ████
  Solar Radiation (MJ/m2)              0.0729  ████
  Rainfall(mm)                         0.0452  ██
  IsRushHour                           0.0352  ██
  Seasons_enc                          0.0321  █
  DayOfWeek                            0.0253  █
  IsWeekend                            0.0238  █
  Dew point temperature(°C)            0.0237  █
  Month                                0.0166  
  Visibility (10m)                     0.0068  
  Holiday_enc                          0.0067  
  Wind speed (m/s)                     0.0045  
  Snowfall (cm)                        0.0005
Out [14]
In [15]Residual Analysis
# Residual Analysis -- Best Model
best_name = min(results, key=lambda k: results[k]['RMSE'])
best_pred = models[best_name].predict(X_test)
residuals = y_test.values - best_pred

print(f"Residual Analysis for {best_name}:")
print(f"  Mean residual: {residuals.mean():.2f}")
print(f"  Std residual:  {residuals.std():.2f}")
print(f"  Min residual:  {residuals.min():.2f}")
print(f"  Max residual:  {residuals.max():.2f}")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(best_pred, residuals, alpha=0.25, s=8, color='#64ffda')
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_title(f'Residual Plot ({best_name})')
axes[0].set_xlabel('Predicted'); axes[0].set_ylabel('Residual')

axes[1].hist(residuals, bins=50, color='#64ffda', alpha=0.8)
axes[1].axvline(x=0, color='red', linestyle='--')
axes[1].set_title('Residual Distribution')
plt.tight_layout()
plt.show()
Out [15]
Residual Analysis for Gradient Boosting:
  Mean residual: -4.72
  Std residual:  160.66
  Min residual:  -2309.85
  Max residual:  1331.65
Out [15]
In [16]RF vs GB Importance Comparison
# Feature Importance Comparison: Random Forest vs Gradient Boosting
rf = models['Random Forest']
comp = pd.DataFrame({
    'feature': feature_cols,
    'Random Forest': rf.feature_importances_,
    'Gradient Boosting': gb.feature_importances_
}).sort_values('Gradient Boosting', ascending=True)

print("Feature Importance Comparison (RF vs GB):")
print(comp.sort_values('Gradient Boosting', ascending=False).to_string(index=False))

fig, ax = plt.subplots(figsize=(10, 7))
y_pos = np.arange(len(comp))
ax.barh(y_pos - 0.2, comp['Random Forest'].values, 0.4, label='Random Forest', color='#64ffda')
ax.barh(y_pos + 0.2, comp['Gradient Boosting'].values, 0.4, label='Gradient Boosting', color='#e07070')
ax.set_yticks(y_pos)
ax.set_yticklabels(comp['feature'].values)
ax.set_title('Feature Importance: RF vs Gradient Boosting')
ax.set_xlabel('Importance')
ax.legend()
plt.tight_layout()
plt.show()
Out [16]
Feature Importance Comparison (RF vs GB):
                  feature  Random Forest  Gradient Boosting
          Temperature(°C)       0.344134           0.337856
                     Hour       0.276564           0.286580
              Humidity(%)       0.084948           0.082297
  Solar Radiation (MJ/m2)       0.084492           0.072897
             Rainfall(mm)       0.036274           0.045226
               IsRushHour       0.037281           0.035177
              Seasons_enc       0.031611           0.032062
                DayOfWeek       0.025238           0.025285
                IsWeekend       0.018051           0.023807
Dew point temperature(°C)       0.022503           0.023723
                    Month       0.013933           0.016595
         Visibility (10m)       0.009160           0.006811
              Holiday_enc       0.005820           0.006729
         Wind speed (m/s)       0.009347           0.004477
            Snowfall (cm)       0.000646           0.000479
Out [16]

Conclusions

Gradient Boosting outperformed all other models with an R² of 0.9342, followed closely by Random Forest (R² = 0.9238). Linear models performed poorly (R² around 0.60), confirming that the relationship between features and demand is highly nonlinear. Temperature and Hour of Day are the dominant predictors, together explaining over 60% of the model's decisions.

The Box-Cox transformation with an optimal lambda of 0.35 effectively normalized the right-skewed target distribution, reducing skewness from 1.15 to -0.13. The full pairwise analysis revealed strong multicollinearity between Temperature and Dew Point (r = 0.91), which is worth noting for linear model interpretation but is handled gracefully by tree-based methods.

These results suggest that bike-sharing operators should prioritize temperature forecasts and time-of-day scheduling when planning bike redistribution. The IsRushHour and Humidity features provide additional predictive value for fine-tuning supply during peak periods.

Tools Used

PythonScikit-learnPandasNumPyMatplotlibSeabornSciPy (Box-Cox)Gradient BoostingRandom ForestSVRCross-Validation